{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CPSC 330 hw8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instructions\n",
    "rubric={points:5}\n",
    "\n",
    "Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "rubric={points:5}\n",
    "\n",
    "For this assignment, you can choose any reasonable dataset for supervised learning. Here is a list of Kaggle datasets in CSV format: https://www.kaggle.com/datasets?fileType=csv. You can also pick a dataset from your own hobby project or research project. \n",
    "\n",
    "If it's not already clear from the dataset, you will need to choose which column you are trying to predict. \n",
    "\n",
    "This week, you will build ML models on your dataset. Next week (hw9), you'll write a blog post about your work.\n",
    "\n",
    "Your task here for these 5 points is to describe where you got the data from. If from Kaggle or another online source, poviding a link is sufficient. If from elsewhere, please elaborate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1\n",
    "rubric={points:10}\n",
    "\n",
    "For your dataset: \n",
    "\n",
    "- Describe your goal (e.g. what you are trying to predict).\n",
    "- Describe a (possibly fictional) scenario where someone in the real world is interested in this task and why it is useful to them. \n",
    "- What is the **decision** that ML is going to help make and what are the alternatives?\n",
    "\n",
    "Max 1 paragraph total."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2\n",
    "\n",
    "Build a model for your chosen problem. Your work must involve the following elements (the list is broken up between the sub-parts):\n",
    "\n",
    "- Split your data into train/test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(a)\n",
    "rubric={points:10}\n",
    "\n",
    "- Perform exploratory data analysis, including outlier detection. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(b)\n",
    "rubric={points:5}\n",
    "\n",
    "- Drop features that would not actually be available during deployment.\n",
    "- Drop features for other reasons, e.g. they would not be useful for predicting the target.\n",
    "- Provide a brief justification explaining the features you dropped."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(c)\n",
    "rubric={points:10}\n",
    "\n",
    "- Build a preprocessing pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(d)\n",
    "rubric={points:5}\n",
    "\n",
    "- Try some models includng, at least:\n",
    "    1. `DummyRegressor` or `DummyClassifer`, as appropriate\n",
    "    2. `Ridge` or `LogisticRegression`, as appropriate\n",
    "    3. `LGBMRegressor` or `LGBMClassifier`, as appropriate \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(e)\n",
    "rubric={points:5}\n",
    "\n",
    "- Hyperparameter tuning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(f)\n",
    "rubric={points:5}\n",
    "\n",
    "- Look at the sub-scores from the different folds of cross-validation and briefly discuss.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(g)\n",
    "rubric={points:5}\n",
    "\n",
    "- Compute at least one relevant scoring metric aside from the default `.score()` and briefly discussing the result.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(h)\n",
    "rubric={points:5}\n",
    "\n",
    "- Evaluate your model on the test set and briefly discuss your level of confidence in whether this is an accurate assessment of your model's future deployment performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3\n",
    "rubric={points:10}\n",
    "\n",
    "The following short-answer questions pertain to Lecture 19 on clustering.\n",
    "\n",
    "1. What's the main difference between unsupervised and supervised learning?\n",
    "2. When choosing $k$ in $k$-means, why not just choose the $k$ that leads to the smallest inertia (sum of squared distances within clusters)?\n",
    "3. You decide to use clustering for _outlier detection_; that is, to detect instances that are very atypical compared to all the rest. How might you do this with $k$-means?\n",
    "4. You decide to use clustering for _outlier detection_; that is, to detect instances that are very atypical compared to all the rest. How might you do this with DBSCAN?\n",
    "5. For hierarchical clustering, we briefly discussed a few different methods for merging clusters: single linkage, average linkage, etc. Why do we have this added complication here - can't we just minimize distance like we did with $k$-means? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submission to Canvas\n",
    "\n",
    "**IF YOU ARE WORKING WITH A PARTNER** please form the group before submitting - see instructions [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#partners).\n",
    "\n",
    "When you are ready to submit your assignment do the following:\n",
    "\n",
    "1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.\n",
    "2. Save your notebook.\n",
    "3. Convert your notebook to `.html` format using the `convert_notebook()` function below **or** by `File -> Export Notebook As... -> Export Notebook to HTML`.\n",
    "4. Run the code `submit()` below to go through an interactive submission process to Canvas.\n",
    ">For this step, you will need a Canvas *Access Token* token. If you haven't already got one, log-in to Canvas, click `Account` (top-left of the screen), then `Settings`, then scroll down until you see the `+ New Access Token` button. Click that button, give your token any name you like and set the expiry date to Dec 31, 2020. Then click `Generate token`. Save this token in a safe place on your computer as you'll need it for all assignments. Treat the token with as much care as you would an important password. \n",
    "\n",
    "Note: for those having trouble with the Jupyter widgets and the dropdowns: if you add the argument `no_widgets=True` to your `submit` call, it should let you do a text-based entry of your key and avoid the dropdowns altogether. If this doesn't work, you probably need to upgrade to the latest version of `canvasutils` with `pip install canvasutils -U` from your terminal with your environment activated.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from canvasutils.submit import submit, convert_notebook\n",
    "\n",
    "# Note: the canvasutils package should have been installed as part of your environment setup - \n",
    "# see https://github.com/UBC-CS/cpsc330/blob/master/docs/setup.md"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert_notebook(\"hw8.ipynb\", \"html\")  # uncomment and run when you want to try convert your notebook to HTML (or you can convert manually from the File menu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# submit(course_code=53561, token=False)  # uncomment and run when ready to submit "
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "name": "_merged",
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "navigate_num": "#000000",
    "navigate_text": "#333333",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700",
    "sidebar_border": "#EEEEEE",
    "wrapper_background": "#FFFFFF"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "438px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
